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Abstract 

Given the vast reservoirs of data stored worldwide, effi- 
cient mining of data from a large information store has 
emerged as a great challenge. Many databases like that 
of intrusion detection systems, web-click records, player 
statistics, texts, proteins etc., store strings or sequences. 
Searching for an unusual pattern within such long strings 
of data has emerged as a requirement for diverse appli- 
cations. Given a string, the problem then is to identify 
the substrings that differs the most from the expected or 
normal behavior, i.e., the substrings that are statistically 
significant. In other words, these substrings are less likely 
to occur due to chance alone and may point to some inter- 
esting information or phenomenon that warrants further 
exploration. To this end, we use the chi-square measure. 
We propose two heuristics for retrieving the top-k sub- 
strings with the largest chi-square measure. We show that 
the algorithms outperform other competing algorithms in 
the runtime, while maintaining a high approximation ratio 
of more than 0.96. 

1 Motivation 

A recent attractive area of research has been the detec- 
tion of statistically relevant sequences or mining interest- 
ing patterns from within a given string Rl llll . Given an 
input string composed of symbols from a defined alpha- 
bet set with a probability distribution defining the chance 
of occurrence of the symbols, and thereby defining its ex- 
pected composition, we would like to find the portions of 
the string which deviate from the expected behavior and 
can thus be potent sources of study for hidden pattern and 
information. An automated monitoring system like a clus- 
ter of sensors sensing the temperature of the surrounding 
environment for fire alert, or a connection server sniff- 



ing the network for possible intrusion detection provides 
a few of the applications where such pattern detection is 
essential. Other applications involve text analysis of e- 
mails and blogs to predict terrorist activities or judging 
prevalent public sentiments, studying trends of the stock 
market, and identifying sudden changes in the mutation 
characteristics of protein sequence of an organism. Simi- 
larly, information extracted from a series of Internet web- 
sites visited, the advertisements clicked on them or from 
the nature of transactions on a database, can capture the 
interests of the end user, prospective clients and also the 
periods of heavy traffic in the system. An interesting field 
of application can be the identification of good and bad 
career patches of a sports icon. For example, given the 
runs scored by Sachin Tendulkar in each innings of his 
one-day international cricket career, we may be interested 
in finding his in-form and off-form patches. 

Quantifying a substring or an observation as unex- 
pected under a given circumstance relies on the proba- 
bilistic analysis used to model the deviation of the behav- 
ior from its expected nature. Such an outcome that devi- 
ates from the expected, then becomes interesting and may 
reveal certain information regarding the source and nature 
of the variance, and we are interested in detecting such 
pockets of hidden data within substrings of an input string. 
A statistical model is used to determine the relationship of 
an experimental or observed outcome with the factors af- 
fecting the system, or to establish the occurrence as pure 
chance. An observation is said to be statistically signif- 
icant if its presence cannot be attributed to randomness 
alone. For example, within a large DNA sequence, the 
recognition of hugely variational patterns involve proba- 
bility matching with large fluctuations, thereby the need 
to predict the locations uses self-consistent statistical pro- 
cedures 0. 

The degree of uniqueness of a pattern can be cap- 
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tured by several measures including the p-value and 
z-score JT] ED. For evaluating the significance of a 
substring, it has been shown that the p-value provides 
a more precise conclusion as compared to that by the 
z-score |4|. However, computing the p-value entails 
enumerating all the possible outcomes, which can be 
exponential in number, thus rendering it impractical. So, 
heuristics based on branch-and-bound techniques have 
been proposed [3|. The log-likelihood ratio, G 2 JU 
provides such a measure based on the extent of deviation 
of the substring from its expected nature. For multinomial 
models, the \ 2 statistic approximates the importance of a 
string more closely than the G 2 statistic (HJ |£| ■ Existing 
systems for intrusion detection use multivariate process 
control techniques such as Hotelling's T 2 measure [5|, 
which is again computationally intensive. The chi-square 
measure, on the other hand, provides an easy way to 
closely approximate the p-value of a sequence JHJ. 
To simplify computations, the x 2 measure, unlike 
Hotelling's method, does not consider multiple variable 
relationship, but is as effective in identifying "abnormal" 
patterns [ 1 1|. Thus, in this paper, we use the Pearson's \ 2 
statistic as a measure of the p-value of a substring |8]|9). 
The x 2 distribution is characterized by the degrees of 
freedom, which in the case of a string, is the number of 
symbols in the alphabet set minus one. The larger the 
X 2 value of a string, the smaller is its p-value, and hence 
more is its deviation from the expected behavior. So, 
essentially our problem reduces to finding the substring 
that has the maximum x 2 value. We propose to extract 
such substrings efficiently. 

Related Work: 

Formally, given a string S composed of symbols from 
the alphabet set S with a given probability distribution 
P modeling the chance of occurrence of each symbol, the 
problem is to identify and extract the top-fc substrings hav- 
ing the maximum chi-square value or the largest deviation 
within the framework of p-value measure for the given 
probability distribution of the symbols. Naively we can 
compute the x 2 value of all the substrings present in S and 
determine the top-k substrings in 0(l 2 ) time for a string of 
length I (see Algorithm[T|i. The blocking algorithm and its 
heap variant proposed in [1], reduce the practical running 
time for finding such statistically important substrings, but 
suffers from a high worst-case running time. The number 
of blocks found by this strategy increases with the size 
of the alphabet set and also when the probabilities of the 
occurrence of the symbols are nearly similar. In such sce- 
narios, the number of blocks formed can be almost equal 
to the length of the given string, thereby degenerating the 



Algorithm 1 Naive Algorithm 

Input: String S with the probability of occurrence of 

each symbol in the alphabet set. 
Output: Top-k substrings having the maximum x 2 value. 

1: Extract all the substrings in S. 

2: Compute the x 2 value of all the substrings. 

3: Return the substrings having the top-k x 2 value. 



algorithm to that of the naive one. The heap variant re- 
quires a high storage space for maintaining the separate 
max and min heap structures and also manipulates a large 
number of pointers. Further, the algorithm does not eas- 
ily generalize beyond static input strings, and cannot han- 
dle top-k queries. In time-series databases, categorizing 
a pattern as surprising based on its frequency of occur- 
rence and mining it efficiently using suffix trees has been 
proposed in [6|. However, the x 2 measure, as discussed 
earlier, seems to provides a better parameter for judging 
whether a pattern is indeed interesting. 

In this paper, we propose two algorithms, All-Pair Re- 
fined Local Maxima Search (ARLM) and Approximate 
Greedy Maximum Maxima Search (AGMM) to efficiently 
search and identify interesting patterns within a string. 
We show that the running time of the algorithms are far 
better than the existing algorithms with lesser space re- 
quirements. The procedures can also be easily extended 
to work in streaming environments. ARLM, a quadratic 
algorithm in the number of local maxima found in the in- 
put string, and AGMM, a linear time algorithm, both use 
the presence of local maxima in the string. We show that 
the approximation ratio of the reported results to the ac- 
tual is 0.96 or more. Empirical results emphasize that the 
algorithms work efficiently. 

The outline of the paper is as follows: Section |2] for- 
mulates the properties and behavior of strings under the 
X 2 measure. Section [3] describes the two proposed algo- 
rithms along with their runtime complexity analysis. Sec- 
tion |4] shows the experimental results performed on real 
and synthetic data, before Section|5]concludes the paper. 

2 Definition and Properties 

Let str = S1S2 ■ ■ ■ si be a given string of length I com- 
posed of symbols Si taken from the alphabet set £ = 
{a±, <?2, ■ ■ ■ , Cm}> where |E| = to. To each symbol 
<7i G E is associated a p ai (henceforth represented as 
Pi), denoting the probability of occurrence of that sym- 
bol, such that YliLt Pi = 1- L et 6<Ji,str (henceforth repre- 
sented as 9i,str) denote the observed number of the sym- 
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bol <7j in the string str, where Oi s S and sir 6 £*. / 

The chi-square value of a string sir G S* of length 2 is 
computed as 
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The chi-square measure thus calculates the deviation of m / 

the composition of the string from its expected nature by _ I (Pjjg ~ ^> a ) ^ b . (p»^fc ~ Qj,b) h 
computing the sum of the normalized square of difference i=i\ P^ai}a + h) Pih (la + h) 



of the observed value of each symbol in the alphabet set 
from the expected value of occurrence. 



(pila - Qj, a ) (pik - Oj,b) 
Pi (L + h) 

Observation 1. Under string concatenation operation (.), \ / (pJ Q — 0j Q ) 2 ^ {Pih ~ @i b) 2 I 2 

for two arbitrary strings a and b drawn from the same = T~f I Ti ~TT~\ 7i T~M 

... , „ , , , «o'6 ■ , \ Pi ( l a T Lb) Pi (la T tfej 

alphabet set and probability distribution of the symbols 1-1 x 

(henceforth referred to as the same universe), f/;e x 2 mea- ^ah (Pila — Oj,a) (jhh ~ 9j,b) 

sure of the concatenated string is commutative in the or- pi (l a + h) 

der of concatenation. 1 m f \ 

(Pila — Vi,a) h (Pih — 9i,b) la \ 



> 



Proof It is easy to observe that the lengths of a.b and b.a l a l b j-^ y ■y/p^JT+hj y/pi (L + h) 

are the same. Further, the observed values of the different 
symbols and their probabilities of occurrence are the same 

in both the concatenated strings. Hence, the x 2 b is equal Therefore, x 2 + xt — x\b- ^ 
to xl a according to Eq. (|T). □ 

Lemma 2. The chi-square value of a string composed of 

Lemma 1. The x 2 value of the concatenation of two only a single type of symbol increases with the length of 

strings drawn from the same universe is less than or equal the string. 

to the sum of the X 2 values of the individual strings. pmof ^ sfr fee a stdng of length l composed only of the 

Proof. Let a and b be two strings, of length l a and l b re- s y mbo1 ^ drawn from the alphabet set E. Here, 9 lMr = 

spectively. Let a.b form the concatenated string having ^' Vi G {1, 2, . . . , m}, i ^ j and 6j tStr = I, as str consists 

length (l a + l b ). Using Eq. ©, the sum of the chi-square onl y °* Substituting the values in Eq. ©, we have 
values of the strings is /„ _ i \ 2 , m 

Xlr = — —+ E P* J W 

Pi — 

If the length of sir is increased by one, by including an- 
other cTj, its chi-square value becomes 



Xo + Xb = > J ; + ; (2) 




2 _ V"^ (Pi {la + lb) ~ 9i,ab 
ab — 2^1 

1=1 ' 2 _ {Pj V- • I 

Using e iab = e i} a + 9 ib and Eqs. © and ©, we have X s tr' - ~ r 2^ Pi( L + l ) 



r-f Pi (la + k) , o 



m / 2 Pj *=M^J 

2,2 2 I (Pila ^Oi.a) , 1 \2 7 / -, \2 m m 

2 2 \ J J i—ij^tj i=l^j 

(Pih ~ Oj,b)" _ (Pi (I a + lb) ~ Oj,ab) \ (5) 

^ l a b ^ J Comparing Eq. (O with Eq. (O, we observe that 

( {p4a - Oi,a) 2 (Pilb — 9i,bf the chi-square value increases, since p l > 0,Vi £ 

PiL + p7b~ {1,2,. ..,m}. □ 

({p i ~9 ) + (p h — ' b)) 2 \ With this setting, we now define the term local maxima 

— — — — ; and describe the procedure for finding such a local max- 

Pi(la+h) / • • u- • . ■ 

y ' / ima within a given string. 
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Definition 1 (Local maxima). The local maxima is a sub- 
string, such that while traversing through it, the inclusion 
of the next symbol does not decrease the \ 2 value of the 
resultant sequence. 

Let S1S2 ■ ■ ■ s n be a local maxima of length n, where 
Si G E, Vi. Then the following holds 



Z \ Z Z \ Z 

XsiS2 — X.SX ' X.S1S2S3 — Xs\S2 



> 



Xsis 2 . 



The process of finding the local maxima involves a sin- 
gle scan of the entire string. We consider the first local 
maxima to start at the beginning of the given string. We 
keep appending the next symbol to the current substring 
until there is a decrease in the chi-square value of the new 
substring. The present substring is then considered to be a 
local maxima ending at the previous position and the last 
symbol appended signifies the start of the next local max- 
ima. Thus by traversing through the entire string once, we 
find all the local maxima present, which takes 0(1) time 
for a string of length I. 

As an example, consider the string str = aaaabbba, 
having £ = {a, b) with the probability of occurrence of 
symbol a as 0.2, and that of b as 0.8. Starting from the 
beginning, we compute the x 2 value of a to be 4. This is 
considered to be the starting of a local maxima. Append- 
ing the next symbol, the chi-square value of aa increases 
to 8. Since the score increases, the current local maxima 
is updated to aa. We keep appending the next symbol 
into the current maxima. We find that xlaaa — 16 an d 
Xaaaab = H-25. As there is a decrease in the chi-square 
value of the substring after insertion of b, the current lo- 
cal maxima becomes aaaa and the next local maxima is 
said to begin at b. Repeating this procedure for the entire 
string str, the local maxima found are aaaa, bbb and a. 

Lemma 3. The expected number of local maxima present 
in a string of length I is O(l). 

Proof. From Lemma [2] we can observe that, in a local 
maxima if the two adjacent symbols are the same, then the 
chi-square value cannot decrease. Thus the current local 
maxima may end only when a pair of adjacent symbols are 
different. We would like to find the expected number of 
positions in the string where such a boundary may exist. 
Let us define an indicator variable Xi, where Xi = 1 if the 
i th and the (i + l) fh symbols in the string are dissimilar, 
and Xi — otherwise. Let X — Yli=i x i> where E[X] 
gives the expected number of local maxima boundaries 
for a string of length I. P(xi = 1) denotes the probability 



of the event Xi = 1. Therefore, 

P(x l = 1) = ^ PjPk = 2 X ^ PjPk 
Vi.fc, j^k M],k, ]<k 

[where j, k 6 {1, 2, . . . , m}] 

1-1 f-i 

E[X] = E\S^ = E[xi] [Linearity of expectation] 

i=l i=l 
Z-l Z-l 

= E p ^ = 1 ) = 2x E E ft** 

i=l i=l \/j,k, j<k 

= 2 x (I - 1) x J2 PoPk (6) 

Vj,k, J<k 

Hence, the expected number of local maxima is 0(1) for 
a string of length /. □ 



However, practically the number of local maxima will 
be much less than I, as all adjacent positions of dissimilar 
symbols may not correspond to a local maxima boundary. 
Using Eq. (O, for m = 2 the maximum number of 
expected local maxima is (7 — l)/2 and is 2(1 — l)/3 for 
m = 3, which is obtained by substituting the maximum 
possible value of P(xi = 1). 

We further optimize the local maxima finding proce- 
dure by initially blocking the string str, as described 
in [l], an d men searching for the local maxima. This 
makes the procedure faster and concise. A contiguous se- 
quence of the same symbol is considered to be a block, 
and is replaced by a single instance of that symbol repre- 
senting the block. If a symbol is selected, the entire block 
associated with it is considered to be selected. 

The next lemma states that if the inclusion of the sym- 
bol representing a block increases the \ 2 value, then the 
inclusion of the entire block will further increase the x 2 
value. This has been proved in Lemma 3.2.5 and Corol- 
lary 3.2.6 on page 35-37 of (TJ. For completeness, we 
include a sketch of the proof in this paper. 

Lemma 4. // the insertion of a symbol of a block in- 
creases the chi-squared value of the current substring, 
then the chi-squared value will be maximized if the entire 
block is inserted. 

Proof. Let the current substring be sub and the adjacent 
block of length n be composed of symbol <j e G X. Ap- 
pending one er e to sub increases the x 2 value of the new 
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substring. 

Given, > — + 

.jf-^ e Pl (hub + 1) 



Pe (I 



> 



(Pe (/sub + 1) - Oe,sub+lY 
sub + 1) 

(Pih ub $i,sub") (yPe^sub @e,sub) 



Pil 
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i^sub 



Pel 



sub 



O r > xlub+1 — Xsub 

By simple algebraic manipulation, we can show that, 

Xsub+n — Xsub+n-1 — ' ' ' — Xsub+2 — Xsub+1 

Hence, by including the entire block the \ 2 value of the 
substring will be maximized. □ 

The entire string is now block-ed and the local maxima 
finding procedure works not with the original str but with 
aba, where the first a represents the four contiguous a's 
in str, the b represents the next three 6's, and the final a 
stands for the last occurrence of a. The local maxima thus 
found are a, b and a. The positions of the local maxima 
are 1, 5 and 8 respectively, according to their position in 
the original string. 

Given the position of component y for each local max- 
ima of a string, we need to extract the global maxima, 
which we formally define as follows. 

Definition 2 (Global maxima). Global maxima is the sub- 
string having the maximum chi-square value, and is the 
substring that we are interested in extracting, i.e., the out- 
put substring. 

The global maxima has the maximum score among all 
possible substrings present in the input string. 



3 Algorithms 

Based on the observations, lemmas and local maxima ex- 
tracting procedure discussed previously, in this section we 
explain the All-Pair Refined Local Maxima (ARLM) and 
Approximate Greedy Maximum Maxima (AGMM) search 
algorithms for mining the most significant substring based 
on the chi-square value. 

3.1 All-Pair Refined Local Maxima Search 
Algorithm (ARLM) 

Given a string str of length I and composed of symbols 
from the alphabet set E, we first extract all the local max- 
ima present in it in linear time, as described earlier. We 



also optimize the local maxima finding procedure by in- 
corporating the idea of the blocking algorithm. With str 
partitioned into its local maxima, the global maxima can 
either start from the beginning of a local maxima or from 
a position within it. Thus, it can contain an entire local 
maxima, a suffix of it or itself be a substring of a local 
maxima. It is thus intuitive that the global maxima should 
begin at a position such that the subsequent sequence of 
characters offer the maximum chi-square value. Other- 
wise, we could keep adding to or deleting symbols from 
the front of such a substring and will still be able to in- 
crease its x 2 value. Based on this, the ARLM heuristic 
finds within each local maxima the suffix having the max- 
imum chi-square value, and considers the position of the 
suffix as a potential starting point for the global maxima. 

Let xyz be a local maxima, where x is a prefix of length 
l x , y is a single symbol at position pos, and z be the re- 
maining suffix having length l z . Categorizing the com- 
ponents, namely x, y and z of a local maxima appropri- 
ately, is extremely crucial for finding the global maxima. 
Let startjpos and endjpos be two lists which are ini- 
tially empty and will contain the position of component 
y, i.e., pos, for each of the local maxima. For a local 
maxima the chi-square value of all its suffices is com- 
puted. The starting position of the suffix having the max- 
imum chi-square value provides the position pos for the 
component y, i.e, yz will be the suffix of xyz having the 
maximum chi-square value. The position pos is inserted 
into the list startjpos. If no such proper suffix exists for 
the local maxima, the starting position of the local max- 
ima xyz relative to the original string is inserted in the 
list. After populating the startjpos list with position en- 
tries of y for each of the local maxima of the input string, 
the list contains the prospective positions from where the 
global maxima may start. 

The string str is now reversed and the same algorithm 
is re-run. This time, the endjpos list is similarly filled 
with positions y' relative to the beginning of the string. 

For simplicity and efficiency of operations, we main- 
tain a table, symbol -count having m rows and I columns, 
where m is the cardinality of the alphabet set. The rows 
of the table contain the observed number of each associ- 
ated symbols present in the length of the string denoted by 
the column. The observed count of a symbol between two 
given positions of the string can thus be easily found from 
this table in 0(1) time. The space required in this case 
becomes 0(lm). However, the table reduces the num- 
ber of accesses of the original string for computing the 
maximum suffix within each local maxima. It also helps 
to generalize the algorithm to streaming environments, 
where it is not possible to store the entire string. 
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Algorithm 2 ARLM Algorithm 

Input: String S with the probability of occurrence of 

each symbol in the alphabet set. 
Output: Top-k substrings with the maximum \ 2 value. 
1: Find all the local maxima in S and S reversed. 
2: startjpos <(— position of suffices with maximum x 2 

value in each local maxima of S. 
3: endjpos <— position of suffices with maximum \ 2 

value in each local maxima of S reversed. 
4: Based on the \ 2 value, return the top-k substrings 

formed from all pairs of positions from the two lists. 



Given the two non-empty startjpos and endjpos lists, 
we now find the chi-square value of substrings from po- 
sition g G startjpos to h G endjpos, and g < h. The 
substring having the maximum value is reported as the 
global maxima. While computing the chi-square values 
for all the pairs of positions in the two list, the top-k sub- 
strings can be maintained using a heap of k elements (see 
Algorithm|2]for the pseudo-code). 

Continuing with our example of str = aaaabbba, the 
startjpos list contains 1, 5 and 8, as the final local max- 
ima does not contain a proper suffix with a larger chi- 
square value greater than itself. Computing on str re- 
versed, the endjpos list will contain 8, 7 and 4. We 
now consider the substrings formed by the pairs (1,8), 
(1,7), (1,4), (5,8), (5,7), and (8,8). Calculating all the chi- 
square values and comparing them, we find that (1,4) has 
the maximum value and is reported as the global max- 
ima which is aaaa. Taking k = 2, we find that the sub- 
string aaaabbba corresponding to (1,8) provides the sec- 
ond highest chi-square value. 

3.2 Analysis of ARLM 

Conjecture 1. The starting position of the global maxima 
is always present in the startjpos list. 

Corollary 1. From the above conjecture, it follows that 
the ending position of the global maxima is also present 
in the endjpos list. 

Proof. This directly follows from the commutative prop- 
erty stated in Section|2] □ 

Finding all the local maxima in the string requires a sin- 
gle pass, which takes 0(1) time for a string of length I. Let 
the number of local maxima in the string be d. Finding the 
maximum valued suffix for each local maxima using the 
symbol -.count table, requires another pass of each of the 
local maxima, and thus also takes 0(1) time. Since, each 
local maximum contributes one position to the lists, the 



number of elements in both the lists is d. In the rare case 
that a local maxima contains two or more suffices with 
the same maximum \ 2 value greater than that of the local 
maxima, we store all such positions in the corresponding 
list. Thus, the lists are of 0(d) length. We then evaluate 
the substrings formed by each possible pair of start and 
end positions, which takes 0(d 2 ). So in total, the time 
complexity of the algorithm becomes 0(1 + d 2 ). 

We justified that although d is of 0(1), the expected 
number of local maxima is far less than that (supported 
by empirical values shown in Section|Ul. So although the 
theoretical running time degenerates to 0(l 2 ), practically 
it is found to be much better. The following optimization 
further reduces the running time of the algorithm. We 
evaluate the chi-square values only when the substrings 
are properly formed from the two lists, i.e., for a given 
pair of start and end positions obtained from the two lists, 
the ending position is greater than or equal to the start- 
ing position. This further reduces the actual running time 
required compared to that given by 0(d 2 ). We empiri- 
cally show that the running time is actually 3-4 times less 
than the naive algorithm which computes and compares 
the value of all the possible substrings present in the orig- 
inal string. 

3.3 Approximate Greedy Maximum Max- 
ima Search Algorithm (AGMM) 

In this section, we propose a linear time greedy algorithm 
for finding the maximum substring, which is linear in the 
size of the input string str. We extract all the local max- 
ima of the input string and its reverse, and populate the 
startjpos and endjpos lists as discussed previously. We 
identify the local maxima suffix max having the maxi- 
mum chi-square value among all the local maxima present 
in the string. AGMM assumes this local maxima suffix 
to be completely present within the global maxima. We 
then find a position g G startjpos for which the new sub- 
string starting at g and ending with max as a suffix has 
the maximum \ 2 value, for all g. Using this reconstructed 
substring, we find a position h 6 endjpos such that the 
new string starting at the selected position g and ending 
at position h has the maximum chi-square measure for all 
positions of h. This new substring is reported by the algo- 
rithm as the global maxima. 

Again, using the example of str = aaaabbba, we find 
max — aaaa and using the two lists, aaaa is returned 
as the global maxima. For k — 2, the heuristic returns 
aaaabbba as the second most significant substring. 

Using the symbol _count table, AGMM takes 0(d) 
time, where d is the number of local maxima found. The 
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Algorithm 3 AGMM Algorithm 



Input: String S with start jpos and endjpos lists. 
Output: Top-k statistically most significant substrings. 

1: max <— suffix having the maximum x 2 value. 

2: G <— strings starting at positions from start jpos with 
max as suffix. 

3: H <— strings in G ending at positions from endjpos. 

4: Return the top-k strings of H based on \ 2 value. 



total running time of the algorithm is 0(d + I). However, 
the substring returned may not be the actual global max- 
ima at all times. The intuition is that the global maxima 
will contain the maximum of the local maxima to maxi- 
mize its value. This assumption is justified by empirical 
results in Section @] which shows that almost always we 
obtain an approximation ratio of 0.96 or more, if not the 
exact value. Being a linear time algorithm, it provides a 
order of increase in the runtime as compared to the other 
algorithms. While finding the values of g and h, we can 
keep a track of the chi-squared values of all the strings 
thus formed. Using these values, the heuristics can be 
used to output the top-k substrings (see Algorithm [3] for 
the pseudo-code). 



4 Experiments 

To assess the performance of the two proposed heuristics 
ARLM and AGMM, we conduct tests on multiple datasets 
and compare it with the results of the naive algorithm and 
the blocking algorithm [ 1 1. The heap variant of the block- 
ing algorithm is not efficient as it has a higher running 
time complexity and uses more memory, and hence has 
not been compared with. The accuracy of the results re- 
turned by the heuristics is compared with that returned by 
the naive algorithm, which provides the optimal answer. 

We used two real datasets: (i) innings by innings 
runs scored by Sachin Tendulkar in one-day internation- 
als (odi;Q and (ii) the number of user clicks on the front 
page of msnbc.coirQ. We have also used synthetic data 
to assess the scalability and practicality of the heuristics. 
We compare the results based on the following parame- 
ters: (i) search time for top-k queries, (ii) number of local 
maxima found, and (iii) accuracy of the result based on 
the ratio of the optimal \ 2 value obtained from the naive 
algorithm to that returned by the algorithms. The exper- 
iments were conducted on a 2.1 GHz desktop PC with 2 



# innings 


Total runs 


Avg. 


#100 


#50 


#0 


425 


17178 


44.50 


45 


91 


20 



Table 1: Sachin Tendulkar's ODI career statistics (as on 
November, 2009). 



Form 


Date 


Avg. 


Runs scored 




22/04/1998 




143,134,33,18,100* 


Best 


to 


84.31 


65,53,17,128,77 


patch 


13/11/1998 




127*,29,2,141,8,3 








118*, 18,11, 124* 


Worst 


15/03/1992 




14,39,15 


patch 


to 


21.89 


10,22,21 




19/12/1992 




32,23,21 



1 http://stats.cricinfo.com/ci/engine/player/35320.html? 
class=2;template=results;type=batting;view=innings 

1 http://archive.ics.uci.edu/ml/ 
datasets/MSNBC.com+Anonymous+Web+Data 



Table 2: Result from Sachin's records. 



GB of memory using C++ in Linux environment. 

4.1 Real Datasets 

Table Q] summarizes the statistics of Sachin Tendulkar's 
present ODI career. The innings where he did not bat 
were not considered. Given his runs, we quantized the 
runs scored into 5 symbols as follows: 0-9 is represented 
by A (Poor), 10-24 by B (Bad), 25-49 by C (Fair), 50-99 
by D (Good) and 100+ by E (Excellent). His innings- 
wise runs were categorized, and from the entire data we 
calculated the actual probability of occurrences of the dif- 
ferent symbols, which were 0.28, 0.18, 0.22, 0.22 and 
0.10 respectively for the five symbols. With this setting, 
we extracted the top-k substring with the maximum chi- 
square value. These results reflect the periods of his career 
when he was in top form or when there was a bad patch, 
since in both cases his performance would deviate from 
the expected. Table [2] summarizes the findings. We find 
that during his best patch he had scored 8 centuries and 
3 half-centuries in 20 innings with an average of 84.31, 
while in the off-form period he had an average of 21.89 in 
9 innings without a score above 40. 

FigureQ]and Figure|2]plot the times taken by the differ- 
ent algorithms and the approximation factor or accuracy 
of result for the heuristics respectively while varying the 
values of top-k queries. The ARLM algorithm takes lesser 
running time as compared to the other procedures, while 
the AGMM method, being a linear time algorithm, is very 
fast. The accuracy of the ARLM heuristic is found to be 
100% for the top-1 query, i.e., it provides the correct re- 
sult validating the conjecture we proposed in Section [3] 
As the value of k increases we find an increase in the ap- 
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Figure 1: 
dataset. 



Time for finding the top-k query in Sachin's run 



Algorithm 


Searching time 


Approx. ratio 


Naive 


75+hrs 


1 


Blocking 


52 hrs 


1 


ARLM 


40 hrs 


1 


AGMM 


3 hrs 


1 



Table 3: Results for dataset (containing 989819 records) 
of number of user clicks. 



Dataset 


# Blocks 


# Local maxima 


Sachin 


319 


281 


Web clicks 


835142 


759921 



Table 4: Number of blocks versus local maxima for real 
datasets. 



Sachin's dataset (I = 425, m = 5) 




0.9 1 1 1 1 1 1 1 1 1 1 1 

5 10 15 20 25 30 35 40 45 50 

Value of k for top-k search 



Figure 2: Approximation ratio of the top-k query in 
Sachin's run dataset. 



proximation ratio of both the heuristics as the number of 
pairs of local maxima involved increases, giving better re- 
sults. The number of local maxima found is lesser than the 
number of blocks constructed by the blocking algorithm 
(see Table @). So, the heuristic prunes the search space 
more efficiently. 

The second real dataset that we considered contained 
the number of user clicks encountered on the front page 
of the website msnbc.com during various periods of a 
day taken from a sample of 989819 users. Analysis of 
the clicks from a group of users provides an insight into 
potent clients for the organization or customers for e- 
commerce purposes. The number of clicks have been cat- 
egorized as follows: 1-3 clicks have been represented by 
A (Low), 4-9 clicks by B (Medium) and 10+ clicks by 
C (High). We accordingly quantized the dataset and then 
performed the experiments by calculating the actual prob- 
ability of occurrences of the different symbols which were 
0.43, 0.36 and 0.21 respectively. Table [3] describes the 



data values and tabulates the result for the top-1 query. 
Due to time-consuming nature of this dataset, we did not 
search for the top-k queries with varying values of k. The 
results show that the ARLM technique has a better run- 
ning time than the others, and also operates on a lesser 
number of local maxima as opposed to the number of 
blocks for the blocking algorithm (see Table HJ. The ap- 
proximation factor for both the heuristics is 1 for the top-1 
search, thereby yielding the correct result. 

4.2 Synthetic datasets 

We now benchmark the ARLM and AGMM heuristics 
against datasets generated randomly using a uniform dis- 
tribution. To simulate the deviations from the expected 
characteristics as observed in real applications, we per- 
turb the random data thus generated with chunks of data 
generated from a geometric distribution with parameter 
p — 0.3. These strings are now mined to extract the top-k 
substrings with largest chi-square values. The parameters 
that affect the performance of the heuristics are: (i) length 
of the input string, I, (ii) size of the alphabet set, m, and 
(iii) number of top-k values. For different values of these 
parameters we compare our algorithms with the existing 
ones on the basis of (a) time to search, (b) approximation 
ratio of the results, and (c) the number of blocks evalu- 
ated in case of blocking algorithm to the number of local 
maxima found by our algorithm. 

4.3 Effect of parameters 

Figure [3] shows that with the increase in the length of 
the input string I, the time taken for searching the top-k 
queries increases. The number of blocks or local maxima 
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Parameters 


Variable 


# Blocks 


# Local maxima 


m=5, 


l=10 a 


831 


742 


k=l 


1=10 4 


7821 


6740 




1=10 5 


77869 


66771 


1=10 4 , 


m=5 


7821 


6740 


k=l 


m=25 


8104 


7203 




m=50 


8704 


7993 



Table 5: Results for uniform dataset. 



Random dataset (m = 5, k = 1) 



Naive 
Blocking 
ARLM 
AGMM 



2500 r 

_ 2000 

cT 
(U 
cn 

o 1500 
E 

p 1000 
500 


10 20 30 40 50 60 70 80 90 100 
Length of the string (I) (x10 3 ) 

Figure 3: Effect of length on search time. 



CO 



Random datasets 




10 20 30 40 50 60 70 80 90 100 
Size of alphabet set (m) 



Figure 4: Effect of size of alphabet size on search time. 



Random dataset (I = 10 , m = 5) 




increases with the size of the string and hence the time to 
compute also increases. The time increases more or less 
quadratically for ARLM and the other existing algorithms 
according to the analysis shown in Section [3~2l ARLM 
takes less running time than the other techniques, as the 
number of local maxima found is less than the number 
of blocks found by the blocking algorithm (see Table [5]). 
Hence, it provides better pruning of the search space and 
is faster. On the other hand, AGMM being a linear time 
heuristic runs an order of time faster than the others. We 
also find that the accuracy of the top-k results reported 
by AGMM show an improvement with the increase 
in the string length (see Figure |7), as the deviation of 
substrings become more prominent with respect to the 
large portions of the string depicting expected behavior. 
The approximation factor for ARLM is 1 for the top-1 
query in all the cases tested, while for other top-k queries 
and for AGMM it is always above 0.96. 

Varying the size of the alphabet set m, we find that 
the time taken for searching the top-k query as well 
as the number of blocks formed increases (Table [5] 
and Figure 01). As to increases, the number of blocks 
increases as the probability of the same symbol occurring 
contiguously falls off. We have observed in Section|2]that 



s . . 1 A 

10 20 30 40 50 

Value of k for top-k query 

Figure 5: Effect of value of k for top-k query on search 
time. 



a local maxima can only end at positions containing adja- 
cent dissimilar symbols. So the number of local maxima 
found increases, thereby increasing the computation time 
of the algorithms. There seems to be no appreciable effect 
of to on the approximation ratio of the results returned 
by the algorithms. We tested with varying values of m 
with I — 10 4 and k = 2, and found the ratio to be 1 in all 
cases. Figure [6] shows the effect of varying probability of 
occurrence of one of the symbols in a string composed of 
two symbols only. The approximation ratio remained 1 
for both heuristics for the top-1 query. 

We next show the scalability of our algorithms by con- 
ducting experiments for varying values of k for top-k sub- 
strings. Figure [5] shows that search time increases with 
the increase in the value of k. This is evident as we are 
required to perform more computations. The accuracy of 
the results for the heuristics increases with k. For k = 2, it 
is 0.96, and increases up to 1 when k becomes more than 
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Random dataset (I = 1 , m = 2, k = 1 ) 



20 



15 



10 







Blocking * 




ARLM - 




AGMM 







0.1 0.2 0.3 0.4 

Probability of occurrence of one symbol 



0.5 



Figure 6: Effect of probability in two symbol string on 
search time. 



random dataset (I = 10 4 , m = 5) 



o 


0.99 


nj 




c 
o 


0.98 


CD 




E 




X 

o 


0.97 


Q. 




Q_ 




< 


0.96 




0.95 



10 20 30 40 

Value of k for top-k query 



50 



Figure 7: Approximation ratio of the top-k query. 



10. The number of blocks or local maxima found remains 
unchanged with the variation of k. 



5 Conclusions 

In this paper, we have proposed two heuristics for search- 
ing a given string for the top-k substrings having the max- 
imum chi-square value representing its deviation from the 
expected nature, with the possibility of hidden pattern or 
information. We described how the chi-square measure 
closely approximates p-value and is apt for mining such 
substrings. We provided a set of observations based on 
which we developed two heuristics, one which runs in 
time quadratic with the number of local maxima, and the 
other which is linear. Our experiments showed that the 
proposed heuristics are faster than the existing algorithms. 
The algorithms return results that have an approximation 
ratio of more than 0.96. 
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